Search results for "text corpus"
showing 10 items of 14 documents
A Study on Classification Methods Applied to Sentiment Analysis
2013
Sentiment analysis is a new area of research in data mining that concerns the detection of opinions and/or sentiments in texts. This work focuses on the application and the comparison of three classification techniques over a text corpus composed of reviews of commercial products in order to detect opinions about them. The chosen domain is about "perfumes", and user opinions composing the corpus are written in Italian language. The proposed approach is completely data-driven: a Term Frequency / Inverse Document Frequency (TFIDF) terms selection procedure has been applied in order to make computation more efficient, to improve the classification results and to manage some issues related to t…
Analysis and Comparison of Deep Learning Networks for Supporting Sentiment Mining in Text Corpora
2020
In this paper, we tackle the problem of the irony and sarcasm detection for the Italian language to contribute to the enrichment of the sentiment analysis field. We analyze and compare five deep-learning systems. Results show the high suitability of such systems to face the problem by achieving 93% of F1-Score in the best case. Furthermore, we briefly analyze the model architectures in order to choose the best compromise between performances and complexity.
Automatic Dictionary Creation by Sub-symbolic Encoding of Words
2006
This paper describes a technique for automatic creation of dictionaries using sub-symbolic representation of words in cross-language context. Semantic relationship among words of two languages is extracted from aligned bilingual text corpora. This feature is obtained applying the Latent Semantic Analysis technique to the matrices representing terms co-occurrences in aligned text fragments. The technique allows to find the “best translation” according to a properly defined geometric distance in an automatically created semantic space. Experiments show an interesting correctness of 95% obtained in the best case.
Syntagmatic and Paradigmatic Associations in Information Retrieval
2003
It is shown that unconscious associative processes taking place in the memory of a searcher during the formulation of a search query in information retrieval — such as the production of free word associations and the generation of synonyms — can be simulated using statistical models that analyze the distribution of words in large text corpora. The free word associations as produced by subjects on presentation of stimulus words can be predicted by applying first-order statistics to the frequencies of word co-occurrences as observed in texts. The generation of synonyms can also be conducted on co-occurrence data but requires second-order statistics. Both approaches are compared and validated …
Methodological Approach for Messages Classification on Twitter Within E-Government Area
2018
The constant growth in the numbers of Social Media users is a reality of the past few years. Companies, governments and researchers focus on extracting useful data from Social Media. One of the most important things we can extract from the messages transmitted from one user to another is the sentiment—positive, negative or neutral—regarding the subject of the conversation. There are many studies on how to classify these messages, but all of them need a huge amount of data already classified for training, data not available for Romanian language texts. We present a case study in which we use a Naive Bayes classifier trained on an English short text corpus on several thousand Romanian texts. …
Review of Non-English Corpora Annotated for Emotion Classification in Text
2020
In this paper we try to systematize the information about the available corpora for emotion classification in text for languages other than English with the goal to find what approaches could be used for low-resource languages with close to no existing works in the field. We analyze the corresponding volume, emotion classification schema, language of each corresponding corpus and methods employed for data preparation and annotation automation. We’ve systematized twenty-four papers representing the corpora and found that corpora were mostly for the most spoken world languages: Hindi, Chinese, Turkish, Arabic, Japanese etc. A typical corpus contained several thousand of manually-annotated ent…
A Methodology for Bilingual Lexicon Extraction from Comparable Corpora
2015
Dictionary extraction using parallel corpora is well established. However, for many language pairs parallel corpora are a scarce resource which is why in the current work we discuss methods for dictionary extraction from comparable corpora. Hereby the aim is to push the boundaries of current approaches, which typically utilize correlations between co-occurrence patterns across languages, in several ways: 1) Eliminating the need for initial lexicons by using a bootstrapping approach which only requires a few seed translations. 2) Implementing a new approach which first establishes alignments between comparable documents across languages, and then computes cross-lingual alignments between wor…
Reflection Assignment as a Tool to Support Students’ Metacognitive Awareness in the Context of Computer-Supported Collaborative Learning
2021
The present study explores the potential of a reflection assignment as a tool for supporting master’s degree students’ metacognitive skills in the context of computer-supported collaborative learning (CSCL). The research question (RQ) is formulated as follows: How does a regularly submitted reflection assignment support the development of students’ individual metacognitive awareness in the context of CSCL? The empirical data is a text corpus (7878 words) extracted from individual students’ (N = 13) reflection assignments (N = 65) submitted during one semester. Qualitative content analysis was employed to analyze the data. The results demonstrate that by the end of the course, the students s…
Graph-based exploration and clustering analysis of semantic spaces
2019
Abstract The goal of this study is to demonstrate how network science and graph theory tools and concepts can be effectively used for exploring and comparing semantic spaces of word embeddings and lexical databases. Specifically, we construct semantic networks based on word2vec representation of words, which is “learnt” from large text corpora (Google news, Amazon reviews), and “human built” word networks derived from the well-known lexical databases: WordNet and Moby Thesaurus. We compare “global” (e.g., degrees, distances, clustering coefficients) and “local” (e.g., most central nodes and community-type dense clusters) characteristics of considered networks. Our observations suggest that …
Supporting Emotion Automatic Detection and Analysis over Real-Life Text Corpora via Deep Learning: Model, Methodology, and Framework
2021
This paper describes an approach for supporting automatic satire detection through effective deep learning (DL) architecture that has been shown to be useful for addressing sarcasm/irony detection problems. We both trained and tested the system exploiting articles derived from two important satiric blogs, Lercio and IlFattoQuotidiano, and significant Italian newspapers.